This data set contains information about 10,000 movies collected from The Movie Database (TMDb),including user ratings and revenue.
#import statements
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sb
import pandas_profiling
from pandas_profiling import ProfileReport
%matplotlib inline
# Load y
df = pd.read_csv('tmdb-movies.csv')
df.head(3)
df.info()
#dropping columns which are not needed!
#these are categorical or id columns
drop_col = ['id','imdb_id','original_title','cast','homepage','director','tagline','keywords','overview',
'production_companies','release_date']
df.drop(drop_col,axis=1,inplace=True)
df.shape
df.isna().sum(axis=0)
df.isna().sum(axis=1).all()
df.fillna(df.mode,inplace=True)
df.info()
Tip: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.
report = ProfileReport(df,explorative=True)
report
#‘genres’, contain multiple values separated by pipe (|) characters
genre_df = pd.get_dummies(df['genres'].str.split('|', expand=True).stack()).sum(level=0)
genre_df
#joining dataframes
df = df.join(genre_df)
df
def help_ploting(df,highlight=12):
"""Helper function to plot
Args:
Input:Dataframe,highlight to a certain bar
Return None
"""
total = []
for col in df.columns:
total.append(np.round(df[col].sum()/df[col].shape[0]*100,2))
total = np.array(total)/313.96*100
df_new = pd.DataFrame({
'lab':df.columns,
'val':total
})
df_c = df_new.sort_values('val')
y = np.array(df_c['val'])
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(18.5, 10.5)
bar = plt.barh(df_c['lab'],df_c['val'],color = 'aqua')
bar[highlight].set_color('dodgerblue')
plt.rcParams["font.family"] = "serif"
plt.rcParams["font.size"] = 20
y = np.round(y,2)
plt.xticks(fontsize=25)
for index, value in enumerate(y):
plt.text(value, index-0.35, str(value))
help_ploting(genre_df,15)
pv = pd.pivot_table(df, columns=df['release_year'],
values=['Action', 'Comedy','Drama', 'Romance','Thriller'], aggfunc='sum')
fig, ax = plt.subplots(figsize=(13, 8))
for col in pv.T.columns:
plt.plot(pv.T.index.values,pv.T[col].values,label=col)
ax.legend()
ax.margins(y=.1, x=.1)
plt.grid(True)
df['revenue_adj'].describe()
df.columns
subdf = df[['popularity','budget_adj','revenue_adj','vote_average','runtime']]
subdf
pd.plotting.scatter_matrix(subdf,figsize=(12,10));
report2 = ProfileReport(subdf,explorative=True)
report2
sb.regplot(subdf['revenue_adj'],subdf['runtime'])
sb.regplot(subdf['revenue_adj'],subdf['popularity'])
sb.regplot(subdf['revenue_adj'],subdf['budget_adj'])
sb.regplot(subdf['revenue_adj'],subdf['vote_average'])
In case of Popular Genres we have seen that these (['Action', 'Comedy','Drama', 'Romance','Thriller']) are most popular from 1960 to 2015.
From 1980 onwards we can clearly see that the film industry has been rising very sharply mainly because of the commercial, public screening of ten of Lumière brothers short films in Paris on 1895 can be regarded as the breakthrough of projected cinematographic motion pictures.
So from 1980 onwards as more or more people start watching movies because of this number of movies start getting produced.
From 1980 to 2000 we can see that there is competition between Thriller,action and romance but after we have clear winner that is Thriller.
Position of Popularity Genre:
1.Drama
2.Comedy
3.Thriller
4.Action
5.Romance
In case of high revenue movies i have compare revenue to the [budget_adj,popularity,vote_avaerage,runtime]
Position of high revenue movies Characteristics
1.vote average
2.budget
3.popularity
4.runtime
more budget,vote a movie have more it revenue it have
runtime and popularity also have a positive factor but they don't infulence movies so much.